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Abstract 


We present a scalable Bayesian model for low- 
rank factorization of massive tensors with binary 
observations. The proposed model has the fol¬ 
lowing key properties: (1) in contrast to the mod¬ 
els based on the logistic or probit likelihood, us¬ 
ing a zero-truncated Poisson likelihood for bi¬ 
nary data allows our model to scale up in the 
number of ones in the tensor, which is espe¬ 
cially appealing for massive but sparse binary 
tensors; (2) side-information in form of binary 
pairwise relationships (e.g., an adjacency net¬ 
work) between objects in any tensor mode can 
also be leveraged, which can be especially use¬ 
ful in “cold-start” settings; and (3) the model ad¬ 
mits simple Bayesian inference via batch, as well 
as online MCMC; the latter allows scaling up 
even for dense binary data (i.e., when the num¬ 
ber of ones in the tensor/network is also mas¬ 
sive). In addition, non-negative factor matrices in 
our model provide easy interpretability, and the 
tensor rank can be inferred from the data. We 
evaluate our model on several large-scale real- 
world binary tensors, achieving excellent compu¬ 
tational scalability, and also demonstrate its use¬ 
fulness in leveraging side-information provided 
in form of mode-network(s). 


1 INTRODUCTION 


With the rece nt surge in multiway, multirelational, o r “ten¬ 
sor" data sets ( Nickel et al. . 2011 : Kang et all 2012 ). learn¬ 
ing algorithms that can extract useful knowledge from such 
data are becomin g increasingly important. Tensor decom¬ 
position methods ( Kolda and Badeii 12009 ) offer an attrac¬ 
tive way to accomplish this task. Among tensor data, of 
particular interest are real-world binary tensors, which are 
now ubiquitous in problems involving social networks, rec- 


* Equal contribution 


ommender systems, and knowledge bases, etc. For in¬ 
stance, in a knowledge base, predicate relations defined 
over the tuples (subjects, objects,verbs) can be represented 
in form of a binary three-way tensor (Kan g et al.. 120121) . 


Usually, real-world binary tensors are massive (each di¬ 
mension can be very large) but extremely sparse (very few 
ones in the tensor). For example, in a recommender system, 
each positive example (e.g., an item selected a set) implic¬ 
itly creates several negative examples (items not chosen). 
Likewise, in a knowledge base, the validity of one rela¬ 
tion automatically implies invalidity of several other rela¬ 
tions. In all these settings, the number of negative examples 
greatly overwhelms the number of positive examples. 


Unfor tunately, b i nary t ensor fa ct orizat io n meth- 
ods (Nickel et al 


20111: IXu et all 120131: iRai et al 


2014), based on probit or logistic likelihood, scale poorly 


for massive binary tensors because these require evaluating 
the likelihood/loss-function on both ones as well as zeros 
in the tensor. One possibility is to use heuristics such 
as undersampling the zeros, but such heuristics usually 
result in less accurat e solutions. Ano t her al t ernative is to 
use th e squared loss ( Hidasi and Tikkl 2012 : Nickel et al 


20121) as the model-fit criterion, which facilitates linear 


scalability in the number of ones in the tensor. How¬ 
ever, such a n ap p roac h can often lead to suboptimal 
results (Ermis and Bouchard, 2014) in practice. 


It is therefore desirable to have methods that can perform 
efficient tensor decomposition for such data, ideally with a 
computational-complexity that depends only on the num¬ 
ber of nonzeros (i.e., the ones) in the tensor, rather than 
the “volume” of the tensor. Motivated by this problem, 
we present a scalable Bayesian model for the Canonical 
PARA FAC (CP) tensor decomposition dKolda and Badei . 


2009), with an inference-complexity that scales linearly in 


the number of ones in the tensor. Our model uses a zero- 
truncated Poisson likelihood for each binary observation in 
the tensor; this obviates the evaluation of the likelihoods for 
the zero entries. At the same time, the significant speed-up 
is not at the cost of sacrificing on the quality of the solu¬ 
tion. As our experimental results show, the proposed like- 













































lihood model yields comparable or better results to logistic 
likelihood based models, while being an order of magni¬ 
tude faster in its running-time on real-world binary tensors. 
Note that replacing the zero-truncated Poisson by the stan¬ 
dard Poisson makes our model also readily applicable for 
count-valued tensors (Chi and Kolda, 2012); although, in 
this exposition, we will focus exclusively on binary tensors. 

Often, side-information ( Acar et all 2011 : Beu tel et al. 
20141) . e.g., pairwise relationships (partially/fully ob 


served), may also be available for objects in some of the 
tensor dimensions. For example, in addition to a binary 
tensor representing AUTHORS x WORDS x VENUES asso¬ 
ciations, the AUTHOR x AUTHOR co-authorship network 
may be available (at least for some pairs of authors). Such 
a network may be especially useful in “cold-start” settings 
where there is no data for some of the entities of a mode 
in the tensor (e.g., for some authors, there is no data in the 
tensor), but a network between entities in that mode may be 
available (See FigUJfor an illustration). Our model allows 
leveraging such network)s), without a significant compu¬ 
tational overhead, using the zero-truncated Poisson likeli¬ 
hood also to model these binary pairwise relationships. 


To facilitate efficient fully Bayesian inference, we develop 
easy-to-implement batch as well as online MCMC infer¬ 
ence; the latter is especially appealing for handling dense 
binary data, i.e., when the number of ones in the tensor 
and/or the network is also massive. Another appealing as¬ 
pect about the model is its interpretability; a Dirichlet prior 
on the columns of each factor matrix naturally imposes 
non-negativity. In addition, the rank of decomposition can 
be inferred from the data. 


2 CANONICAL PARAFAC (CP) TENSOR 
DECOMPOSITION 


The Canonical PARAFA C (CP) decomposi¬ 
tion ( Kolda and Bader . 2009 ) offers a way to express 
a tensor as a sum of rank-1 tensors. Each rank-1 tensor 
corresponds to a specific “factor” in the data. More 
specifically, the goal in CP decomposition is to decompose 
a tensor y of size ni x n 2 x ■ ■ ■ x with denoting 
the size of y along the k th mode (or “way”) of the tensor, 
into a set of K factor matrices U^ 1 ),...,where 


U (fe ) = [u 


(k) 

i 


44 


k = {1 ,K}, denotes the 


rik x R factor matrix associated with mode k. 


In its most general form, CP decomposition expresses the 
tensor y via a weighted sum of R rank-1 tensors as 

R 

y ~ 0 - 0 4 ) ) (!) 

r= 1 

In the above, the form of the link-function / depends on 
the type of data being modeled (e.g., / can be Gaussian for 
real-valued, Bernoulli-logistic for binary-valued, Poisson 
for count-valued tensors). Here X r is the weight associated 


with the r th rank-1 component, the rife x 1 column vec¬ 
tor Uj k j represents the r th latent factor of mode k, and © 
denotes vector outer product. 

We use subscript i = {ii,... ,ix} to denote the K- 
dimensional index of the i-th entry in the tensor y. Using 
this notation, the i -th entry of the tensor y can be written 

as Vi ~ /(£r=i K nf=i 


3 TRUNCATED POISSON TENSOR 

DECOMPOSITION FOR BINARY DATA 


Our focus in this paper is on developing a probabilistic, 
fully Bayesian method for scalable low-rank decomposi¬ 
tion of massive binary tensors. As opposed to tensor de- 
compositi on models based on the logistic li kelihood for bi¬ 
nary data ( Xu et al. . 20131 ; Rai et al. , 2014 ). which require 
evaluation of the likelihood for both ones as well as zeros 
in the tensor, and thus can be computationally infeasible 
to run on massive binary tensors, our proposed model only 
requires the likelihood evaluations on the nonzero (i.e., the 
ones) entries in the tensor, and can therefore easily scale 
to massive binary tensors. Our model is applicable to ten¬ 
sors of any order K > 2 (the case K = 2 being a binary 
matrix). 


Our model is based on a decomposition of the form given in 
Eq. Q] however, instead of using a Bernoulli-logistic link / 
to generate each binary observation m in y, we assume an 
additional layer (Eq. 13 which takes a latent count-valued 
Hi in y and thresholds this latent count at one to gener¬ 
ate the actual binary-valued entry bi in the observed binary 
tensor, which we will denote by B: 


bi 

= l(i/i>l) 

(2) 


R 


y 

~ Pois(^ A r-U^ 0 ... 0 u^) 

(3) 


~ Dir(a (fc) ,...,a (fc) ) 

(4) 

A r 

~ Gamma (g r , — —) 

1 - Pr 

(5) 

Pr 

~ Beta(ce, c(l — e)) 

(6) 


Marginalizing out yi from Eq. [2] leads to the following 
(equivalent) likelihood model 

R K 

bi ~ Bernoulli(l — exp(— E^INS)) (7) 

r=l fc=l 


Note that the thresholding in © looks similar to a probit 
model for binary data (which however thresholds a normal 
at zero)-, however, the probit model (just like the logistic 
model) also needs to evaluate the likelihood at the zeros, 
and can therefore be slow on massive binary data with lots 
of zeros. Likelihood models of the form (Eq.[7} have pre¬ 
viously also been consid ered in work on sta t istical model s 
of undirected networks ( Morup et al. . 201 li e Zhoul 2015 ). 




























Interestingly, the form of the likelihood in [7] also resem- 
bles the comple mentary log-log function IColletd (120021) : 
Piegorsch ( 19921) . which is known to be a better model for 
imbalanced binary data than the logistic or probit likeli¬ 
hood, making it ideal for handling sparse binary tensors. 

The conditional posterior of the latent count iji is given by 


K 


Vi\bi, A, {« 4 r}f =1 ~ h • Pois +(Y] K n u£l) (8) 


r —1 


fc= 1 


where Pois + (-) is zero truncated Poisson distribution. Eq. 
© suggests that if bi = 0, then yi = 0 almost surely 
with probability one, which can lead to significant compu¬ 
tational savings, if the tensor has a large number of zeros. 
In addition, our model also enables leveraging a reparame¬ 
terization (Section [372} of the Poisson distribution in terms 
of a multinomial, which allows us to obtain very simple 
Gibbs-sampling updates for the model parameters. 


(k) 

Note that the Dirichlet prior on the latent factors yP nat 
urally imposes non-negativity constraints ( Chi and Kolda, 


20121) on the factor matrices Tjl 1 ),...,U^. Moreover, 


since the columns of these factor matrices sums to 1, 
each u r can also be interpreted as a distribution (e.g., a 
“topic”) over the Uk entities in mode k. Furthermore, the 
gamma-beta hierarchical construction (Zhou et ah, 2012) 
of A r (Eq 0 and © allows inferring the rank of the ten¬ 
sor by setting an upper bound R on the number of factors 
and inferring the appropriate number of factors by shrink¬ 
ing the coefficients A r ’s to close to zero for the irrelevant 
factors. These aspects make our model interpretable as well 
as provide it the ability to do model selection (i.e., inferring 
the rank), in addition to being computationally efficient by 
focusing the computations only on the nonzero entries in 
the tensor B. 


3.1 LEVERAGING MODE NETWORKS 

Often, in addition to the binary tensor B, pairwise rela¬ 
tionships between entities in one or more tensor modes 
may be available in form of a symmetric binary network 
or an undirected graph. Leveraging such forms of side- 
information can be beneficial for tensor decomposition, es- 



start” settings, where there is no data in the tensor for enti¬ 
ties along some of the tensor mode(s), as shown in Fig© In 

the absence of any side-information, the posterior distribu- 

(k) 

tion of the latent factors itf ’ of such entities in that tensor 
mode would be the same as the prior (i.e., just a random 
draw). Leveraging the side-information (e.g., a network) 
helps avoid this. 

For entities of the fc-th mode of tensor B, we assume a sym¬ 
metric binary network AW <E {0,l}" feXTlfc , where A^] k 


Binary Network 
of Mode-1 Objects 



Completely 
Missing Part 


Figure 1: Binary tensor with an associated binary network be¬ 
tween objects in mode-1 of the tensor (in general, network for 
other modes may also be available). In the “cold-starf’setting as 
shown above, data along some of the tensor dimensions will be 
completely missing 

denotes the relationship between mode-fc entities ik and jk- 

Just like our tensor decomposition model, we model the 
mode-/.' network A ! , '2 as a weighted sum of rank-1 sym¬ 
metric matrices, with a similar likelihood model as we use 
for the tensor observations. In particular, we assume a la- 


tent count xj k ] 

for each binary entry A , and threshold 

(k) 

it at one to generate A\J- k 


A {k) 

Ikjk 

= i(4t >i) 

(9) 


R 


x( fc ) 

~ Pois(^ Pr-U^ © 
r= 1 

GO) 

Pr 

~ Gamma(/ r , ^ ^ ) 

(11) 

h r 

~ Beta(da,d(l — a)) 

(12) 


Note that since A (fc ) is symmetric, only the upper (or 
lower) triangular portion needs to be considered, and more¬ 
over, just like in the case of the tensor B, due to the 
truncated Poisson construction, the likelihood at only the 
nonzero entries needs to be evaluated for this part as well. 


3.2 REPARAMETERIZED POISSON DRAWS 

To simplify posterior inference (Section©, we make use 


of two re-parameterizations of a Poisson draw (Zhou et al 


2012). The first parameterization is to express each latent 
<k> as a sum of another set of R 
}^L 1 , respectively 


count variable yi and 
latent counts {yi r }^-\ and {X, 


Ikjkr] 


X . 


(fc) 


R 

Vi = ^ ' Vir i 
r =1 
R 


K 


= £■*< 


(fc) 

ikjkr’ 


X. 


Vir 


(fc) 


Pois(A r Y 


R k rJ 


«;.■;) (13) 


fc=i 


Pois(/3 r-u^lu^l) (14) 


r —1 


The second parameterization assumes that the latent counts 


{y ir } and X, 


(fc) 


are drawn from a multinomial 


Vi 1 ? • • • j ViR ^ Mult(?/i 5 Cil ? ■ • ■ ? C iR ) 


Cir 


\ rr^ (&) 

Ar IIfc=l U i k r 

R \ i“r K. (fc) 
J2r=l ilfc=l u i k r 


(15) 




































Y'(k) v'(^) 


M u!t(^ fc ; - - -, } 


ikjkR' 


J k ) 

^ikjkr 


f3 r li 


W (k) 
ikr a j k r 


2-/r=l 


/3 r u 


(k) (k) 
ikr a jkr 


(16) 


As we show in Section [4] these parameterizations en¬ 
able us to exploit the gamma-Poisson as well as the 
Dirichlet-multinomial conjugacy to derive simple, closed- 
form Gibbs sampling updates for the model parameters. 


4 INFERENCE 


Exact inference in the model is intractable and we resort 


to Markov Chain Monte Carlo (MCMC) (A ndrie u et af 


20031) inference. In particular, the reparameterization dis¬ 


cussed in Sec tion [T2l allows us to derive simple Gibbs sam¬ 
pling updates for all the latent variables, except for the la¬ 
tent counts yi, which are drawn from a truncated Poisson 
distribution via rejection sampling. As discussed earlier, 
the computational-cost for our inference method scales lin¬ 
early w.r.t. the number of ones in the tensor (plus the num¬ 
ber of nonzeros in the network, if side-information is used). 
This makes our method an order of magnitude faster than 
mod els based on logistic or p r obit l ikelihood for binary 
data ( Rai et ah . 2014 : Xu et ah . 2013 ). without sacrificing 
on the quality of the results. The relative speed-up depends 
on the ratio of total volume of the tensor to the number of 
ones, which is given by n fe)/ nnz (®); here nnz(£>) 

denotes the number of nonzeros in the tensor. 


In this section, we present both batch MCMC (Section l-kTl ) 
as well as an online MCMC (Section [4.2b method for in¬ 
ference in our model. The online MCMC algorithm is 
based on the idea of Bayesian Condit ional Density Fil¬ 
tering (CDF) (IGuhanivogi et all 120141) . and can lead to 
further speed-ups over the batch MCMC if the number 
of nonzeros in the tensor is also massive. The CDF 
algorithm provides an efficient way to perform online 
MCMC sampling using surroga te conditional sufficient 
statistics (IGuhanivogi et al.L 120141) . 


For both batch MCMC and CDF based online MCMC, 
we provide the update equations, with and without the 
side-information, i.e., the mode network(s). For what fol¬ 
lows, we define four quantities: = Yhi-i k =j Vir, s r = 

EiW." - E”‘ 41, and », - TS E”‘ 43„. 

which denote aggregates computed using the latent counts 
jji r and X^j kr . These quantities will be used at various 
places in the description of the inference algorithms that 
we present here. 


4.1 BATCH MCMC INFERENCE 

4.1.1 Tensor without Mode Network(s) 

Sampling z/^: For each observation bi in the tensor, the 
latent count yi is sampled as 


R K 

Vi ~ h ■ II '“S) ( 17) 

r =1 k—1 


where Pois + (-) is zero truncated Poisson distribution. Eq. 
(fTTI) suggests that if bi = 0, then yi = 0 almost surely; and 
if bi = 1, then y t ~ Pois + (Jjf =1 A r ]\k=\ u 5)- There¬ 
fore the y ^s only need to be sampled for the nonzero bi s. 
Sampling jji r : The latent counts { f/i, } are sampled from 
a multinomial as Eq. <ED>. Note that this also needs to be 

done only for the nonzero bi s. 

(k) 

Sampling u r : The columns of each factor matrix have a 
Dirichlet posterior, and are sampled as 


,(fc) 


■ DirtaW + sg.aW+sg,.. < 18 ) 


Sampling p r : Using the fact that s r = Y^iVi,r and 
marginalizing over the u^’s in o. we have s r ~ 
Pois(A r ). Using this, along with (0, we can express s r us¬ 
ing a negative binomial distribution, i.e., s r ~ NB(g r ,p r ). 
Due to the conjugacy between negative binomial and beta, 
we can then sample p r as 


p r ~ Beta(ce + s r , c(l - e) + g r ) (19) 

Sampling A r : Again using the fact that s r ~ Pois(A r ) and 
0, we have 

A r ~ Gamma(p r . + s r: p r ) (20) 

As can be observed, when updating u[ k \ p r and A r , the 
latent counts yi s and in, corresponding to zero entries in B 
are all equal to zero, and have no contribution to sufficient 
statistics sj r and s r . Therefore, only the nonzero entries 
in tensor need to be considered in the computations. 


4.1.2 Tensor with Mode Network(s) 

In the presence of mode network(s), the update equations 
for the latent variables p r , A r , yi r and yi, that are asso¬ 
ciated solely with the binary tensor B, remain unchanged, 
and can be sampled as described in Section l4.1.1l We how¬ 
ever need to sample the additional latent variables associ¬ 
ated with mode-/.' network A : L > , and the latent factors u < r > 
of mode-fc that are shared by the binary tensor B as well as 
the mode-/;: network. 

(k) (k) 

Sampling AQ : The latent counts are sampled as 

42. ~ 42. ■ r°iMX/s,<4‘X2> (21) 

r=l 

This only needs to be done for the nonzero entries in A (t K 

Sampling Xi k j kr : The latent counts Xi k j kr are sampled 
from a multinomial as equation (IT6l) . This also only needs 
to be done for the nonzero entries in A : '' ! . 




















Sampling uE 1 : The columns of each factor matrix have a 
Dirichlet posterior, and are sampled as 


,(*0 


' Dir(a 


(fc) . 


~l,r 


' Ul,i 


,a 


( fe ) i «( fe ) 


S n k ,r + W 


„r) 


( 22 ) 

Note that in the absence of the mode-fc network, the terms 
V j- go away and Eq.l22l simply reduces to Eq.fl8l 

Sampling h r : h r ~ Beta(da + v r , d( 1 — a) + f r ). 

Sampling f3 r : /3 r ~ Gamma (f r +v r ,h r ). 


4.1.3 Per-iteration time-complexity 


For the binary tensor B, computing each Q r (Eq. fiTl i takes 
O(K) time and therefore computing all the { Q /r } takes 
0(nnz{B)RK) time. Likewise, for the binary mode-fc 
network A <k '\ computing all the K^rl (Eq- ed takes 
0(niY/(A !k> )R) time. These are the most dominant com¬ 
putations in each iteration of our MCMC procedure; updat¬ 
ing each Ur ' 1 takes 0(rik) time and updating {p r , h r }Ei 
and {A r , /3 r }^ =1 takes 0(R) time each. Therefore, the per- 
iteration time-complexity of our batch MCMC method is 
(D(nnz(B)RK + nnz(A ^)R). The linear dependence on 
nnz(£>), nnz(Al fc )), R and K suggests that even massive, 
sparse binary tensors and mode network(s) can be han¬ 
dled easily even by our simple batch MCMC implemen¬ 
tation. Also note that our model scales linearly even w.r.t. 
R, unlike most other methods ( Ermis and Bouchard 2014 
Rai et ah. 2014 ) that have quadratic dependence on R. 


needs to be done for the nonzero bd s. 


Sampling j/j r : For all i £ I t , sample the latent counts 
Vir{ieh) usin g 03. again only for the nonzero bi s. 

Updating the conditional sufficient statistics: Update the 
conditional sufficient statistics sE as = sE* + 

Eiei t -.i k =j Vir and update s r as sE = s^ _1) +Eiei t Vi,r- 
These updates basically add to the old sufficient statistics, 
the contributions from the data in the current minibatch. In 
practice, we also reweight these sufficient statistics by the 
ratio of the total number of ones in B and the minibatch 
size, so that they represent the average statistics over the 
entire tensor. This reweighting is akin to the way average 
gradients are compu ted in stochastic variational inference 
methods ( Hoffman et al. . 20131) . 


(k) 

Updating v,}- . p r , A,: Using the following conditionals, 
draw M samples {ui k ’ m \pr n \ Ar m ^}^f = 


=i 


Pr 

A r 


Dir(a 


(fc) 


( k,t ) 

H,r ,■■■■, a 

,(*> 


(fc) 


4tr) (23) 


Beta(ce + 8®, c(l — e) + g r ) (24) 
Gamma(5 r + s^,j) r ) (25) 


(k) 

and either store the sample averages of u r , p r , and A r , 
or t heir analytic means to u se for the next CDF itera¬ 
tion (Guhaniyogi et ah, 20141) . 


4.2.2 Tensor with Mode Network(s) 


The above computations can be further accelerated us¬ 
ing a distributed/multi-core setting; we leave this for fu¬ 
ture work. In Section l4~2l however, we present an online 
MCMC method b ased on the idea of Baye sian Conditional 
Density Filtering (IGuhanivogi et all 120141) . which leads to 
further speed-ups, even in single-machine settings. 


4.2 ONLINE MCMC INFERENCE 


We develop an efficient online MCMC sampler for the 
model, lever aging ideas from the Con ditional Density Fil¬ 
tering (CDF) Guhaniyogi et al. ( 2014 ). The CDF algorithm 
for our model selects a minibatch of the tensor (and mode 
network, if the side-information is available) entries at each 
iteration, samples the model parameters from the posterior. 


(k) 

and updates the sufficient statistics s)- r , s r , Vi ktT 
using the data from the current minibatch. 


and v r 


4.2.1 Tensor without Mode Network(s) 

We first provide the update equations for the case when 
there is no side-information (mode network). Denote I t as 
indices of entries of tensor B from the minibatch selected 
at iteration t. The CDF algorithm at iteration t proceeds as: 

Sampling yp. For all i £ It, sample tji according to equa¬ 
tion O; like in the batch MCMC case, the sampling only 


For all the latent variables associated solely with the ten¬ 
sor B, the sampling equations for the CDF algorithm in the 
presence of mode network(s) remain unchanged as the pre¬ 
vious case with no network. In the presence of the mode 
network, the additional latent variables include the suffi¬ 
cient statistics Vi k>r and v r , and these need to be updated in 
each CDF iteration. 


Denote J t as indices of entries selected from the mode-fc 
network ,A' k> in iteration t. The update equations for the 
latent variables that depend on A 1 '"^ are as follows: 

Sampling X ikjk : For (■ i k ,j k ) 6 Jt, latent count X ikjk is 
sampled using Eq. (1211 . 

Sampling X ikjkr : For ( i k ,j k ) £ Jt, latent counts X lkhr 
are sampled from a multinomial using Eq. (IT6t . 


Updating the conditional sufficient statistics: Update the 
sufficient statistics associated with the mode-fc network as 


(i) _ (t-i) 

^i k ,r ^i k ,r 


X^ nk Y 

" ikjkr 


and Vr^ = ^ - 


E? k Ej k ,(ikJk)eJt X i kjk r. Just like the way we update 


(k) 

the tensor sufficient statistics r and s r , we reweight these 
mode-fc sufficient statistics by the ratio of the total number 
of ones in A' k> and the minibatch size, so that they repre¬ 
sent the average statistics over the entire mode-fc network. 


(k) 

Updating u r , h r , [i r : 


Using the following condition- 

























als, draw M samples {ul k ’ m \ h^\ /3r T ”' l }m=i- We draw 
wf* ~ Dir(a (fc) + s[ k ^ + v ^ r ,..., a (fe) + Sn^J + 1 ’nl,r), 
and h r and p r as 

h r ~ Beta(da + v^\ d(l — a) + f r ) 

P r ~ Gamma(/ r + t)^,/i r ) (26) 

and either store the sample averages of u[ k \ h r , /3 r , or their 
analytic means to use for the next CDF iteration. 

4.2.3 Per-iteration time-complexity 

The per-iteration time-complexity of the CDF based online 
MCMC is linear in the number of nonzeros in each mini¬ 
batch (as opposed to the batch MCMC where it depends on 
the number of nonzeros in the entire tensor and network). 
Therefore the online MCMC is attractive for dense binary 
data, where the number of nonzeros in the tensor/network 
is also massive; using a big-enough minibatch size (that fits 
in the main memory and/or can be processed in each iter¬ 
ation in a reasonable amount of time), the online MCMC 
inference allows applying our model on such dense binary 
data as well, which may potentially have several billions of 
nonzero entries. 

5 RELATED WORK 

With the increasing prevalence of structured databases, so¬ 
cial networks, and (multi)relational data, tensor decompo¬ 
sition methods are becoming increasingly popular for ex- 
tracti ng knowledge and doing predictive analytics on such 
data ( Bordes et al. . 201 it Nickel etldl 2012 : Kang et al. . 
2012 ). As the size of these data sets continues to grow. 


there has been a pressing need to design tensor factoriza¬ 
tion methods that can scale to massive tensor data. 

For low-rank factorization of binary tensors, methods 
based on logistic and probit likelihood for t he binary data 


2013: Rai et al.. 2014: Xu et al 


have been proposed ( Jenatton et all |2012c iLondon et al 


201 tIT 


However, these 
methods are not suited for massive binary tensors where 
the number of observations (which mostly consist of zeros, 
if the tens or is also sparse) could easily be millions or even 


billions (Inah et al 


120151). As a heur istic, these methods 
rely on subsamp ling (Rai et al. , 2014 ) or partitioning the 
tensor ( Zhe et al .. 20151) . to select a manageable number 


entries before performing the tensor dec ompositi on, or al¬ 
ternatively going for a distributed setting ( Zhe et all 2013 ). 


In the context of tensor factorization, to the best of 
our knowledge, the only method (and one that is clos¬ 
est in spirit to our work) that scales linearly w.r.t. the 
number of ones in the tensor is dErmis and Bouchard , 
2014 ). Their work explored quadratic loss (and its vari¬ 
ations) as a surrogate to the logistic loss and proposed a 
method (Quad-Approx) with a per-iteration complexity 
0(m\z(B)R 4- R 2 n k)- Note that its dependence on 


R is quadratic as opposed to our method which is also lin¬ 
ear in R. They also proposed variations based on piece- 
wise quadratic app roximati ons: however, as reported in 
their experiments ( Ermis and Bouchard . 2014 ), these vari¬ 
ations were found to be about twice as slow than their ba¬ 


sic Quad-Approx method (Ermis and Bouchard, 12014). 
Moreover, their methods (and the various other methods 
discussed in this section) have several other key differences 
from our proposed model: (1) our model naturally imposes 
non-negativity on the factor matrices; (2) R can be inferred 
from data; (3) our method provides a fully Bayesian treat¬ 
ment; (4) in contrast to their method, which operates in 
a batch setting, the online MCMC inference allows our 
model to scale to even bigger problems, where the number 
of nonzeros could also be massive; and (5) our model also 
allows incorporating (fully or partially observed) mode- 
networks as a rich source of side-information. 


In another recent work (Zhou, 2015[), a similar zero- 
truncated Poisson construction, as ours, was proposed for 
edge-partioning based network clustering, allowing the 
proposed model to scale in terms of the number of edges in 
the network. Our model, on the other hand, is more general 
and can be applied to multiway binary tensor data, with an 
optionally available binary network as a potential source of 
side-information. Moreover, the Dirichlet prior on the fac¬ 
tor matrices, its reparametrizations (Section [3.21) . and the 
online MCMC inference lead to a highly scalable frame¬ 
work for tensor decomposition with side-information. 

Another line of work on scaling up tensor factorization 
met hods involves developing distributed and parallel meth 


ods (Kan g et all 2012t Inah et al. . 2015 : Papalexakis et al 


2012 : Beutel et all 2014lf Most of these methods, how 

ever, have one or more of the following limitations: (1) 
these methods lack a proper generative model of the data, 
which is simply assumed to be real-valued and the opti¬ 
mization objective is based on minimizing the Frobenius 
norm of the tensor reconstruction error, which may not be 
suitable for binary data; (2) these methods usually assume 
a parallel or distributed setting, and therefore are not feasi¬ 
ble to run on a single machine; (3) missing data cannot be 
easily handled/predicted; and (4) the rank of the decompo¬ 
sition needs to be chosen via cross-validation. 

Leveraging sources of side-information for tensor factor¬ 
ization has also been gaining a lot of attention recently. 
However, mo st of these method s cannot sc a le easi ly to mas¬ 
sive tensors ( Acar et al. . 2011 : Rai et al. . 2015 ). or have 
to rely on parallel or distributed computing infrastruc¬ 


tures ( Beutel et all 20141) . In contrast, our model, by the 


virtue of its scalability that only depends on the number 
of nonzero entries in the tensor and/or the mode network, 
easily allows it to scale to massive binary tensors, with or 
without mode-network based side-information. 






















































































































6 EXPERIMENTS 

We report experimental results for our model on a wide 
range of real-world binary tensors (with and without mode- 
network based side-information), and compare it with sev¬ 
eral baselines for binary tensor factorization. We use the 
following data sets for our experiments: 


• Kinship: This is a binary tensor of size 104 x 104 x 
26, representing 26 t ypes o f relationships between 104 


members of a tribe (Nickel et ah, 120111) . The tensor 
has about 3.8% nonzeros. 


• UMLS: This is a binary tensor of size 135 x 135 x 49 
representing 56 types_of verb relations between 135 
high-level concepts (INickel et all 201 lj). The tensor 


has about 0.8% nonzeros. 


• Movielens: This is a binary matrix (two-way tensor) 
of size 943 x 1682 representing the binary ratings 
(thumbs-up or thumbs-down) by 943 users on 1682 
moviesQ. This data set has a total of 100,000 ones. 


DBLP: This is a binary tensor of size 10, 000 x 200 x 
10,000 representing (au thor-conference-keyword) re¬ 
lations ( Zhe et ah ; 20151) . This tensor has only about 
0.001% nonzeros, and is an ideal example of a mas¬ 
sive but sparse binary tensor. 


• Scholars: This is a binary tensor of size 2370 x 8663 x 
4066, constructed from a database of research paper 
abstracts published by researchers at Duke University; 
the three tensor modes correspond to authors, words, 
and publication venues, respectively. Just like the 
DBLP data, this tensor is also massive but extremely 
sparse with only about 0.002% nonzeros. In addition, 
the co-authorship network (i.e., who has written pa¬ 
pers with whom) is also available, which we use as 
a source of side-information, and use this network to 
experiment with the cold-start setting (i.e., when the 
main tensor has no information about some authors). 


• Facebook: The Facebook data is a binary ten¬ 
sor of size 63731 x 63730 x 1837 with the 
three modes representing wa ll-owner, poster, and 
days (IPapalexakis et all 120131) . This tensor has only 
737498 nonzeros. In addition to the binary tensor, the 
social network (friendship-links) between users is also 
given in form of a symmetric binary matrix of size 
63731 x 63731, which has 1634180 nonzeros. We use 
the network to experiment with the cold-start setting. 


We use all the 6 data sets for the tensor completion ex¬ 
periments (Section 16.11 ). We also use the Scholars and 
Faceboook data in the cold-start setting, where we exper¬ 
iment on the tensor completion task, leveraging the mode- 
network based side-information (Section |6.4 [ i. 

’http://grouplens.org/datasets/movielens/ 


The set of experiments we perform includes: (1) binary 
tensor completion (Section [6Tl) using only the tensor data; 

(2) scalability behavior of our model (both batch as well 
as online MCMC) in terms of tensor completion accuracy 
vs run-time (Section |6.2b : we compare our m odel with 
Bayesian CP based on logistic-likelihood ( Rai et al. . 20141) : 

(3) a qualitative analysis of our results using a multiway 
topic modeling experiment (Section 16.31 ) on the Scholars 
data, with the entities being authors, words, and publica¬ 
tion venues; and (4) leveraging the mode network for ten¬ 
sor completion in the cold-start setting (Section 16.41) : for 
this experiment, we also demonstrate how leveraging the 
network leads to improved qualitative results in the multi¬ 
way topic modeling problem. 

In the experiments, we refer to our model as ZTP- 
CP (Zero-Truncated Poisson based CP decomposition). 
We compare ZTP-CP (using both batch MCMC as well 
as online MCMC inference) with the following base¬ 
lines: (1) the quad ratic lo ss minimizati on (Quad-App) 
proposed in ( Ermis and Bouchard . 20141) : (2) the re¬ 
fined piecew ise qu adrat ic app rox imatio n algorithm (PW- 
QuadApp) dErmis and Bouchard! 2014 ): and (3) Bayesian 
CP de composi tion based on logistic likelihood for binary 
data ( Rai et al. . 2014 ). 


Experimental settings: All experiments are done on a 
standard desktop computer with Intel i7 3.4GHz processor 
and 24GB RAM. Unless specified otherwise, the MCMC 
inference was run for 1000 iterations with 500 burn-in it¬ 
erations. The online MCMC algorithm was also run for 
the same number of iterations, with minibatch size equal to 
one-tenth of the number of nonzeros in the training data. 
For all the data sets, except Scholars and Facebook, we use 
R = 20 (also note that our model has the ability to prune 
the unnecessary factors by shrinking the corresponding X r 
to zero). For Scholars and Facebook data, we set R = 100. 
The hyperparameters g r , f r were set to 0.1, and e and a are 
set to 1/R, which worked well in our experiments. 

6.1 TENSOR COMPLETION 


In Table [T] we report the results on the tensor completion 
task (in terms of the AUC-ROC - the area under the ROC 
curve). For this experiment, although available, we do not 
use the mode network for the Scholars and the Facebook 
data; only the binary tensor is used (the results when also 
using the network are reported in Section 16.4b . For each 
data set, we randomly select 90% of the tensor observa¬ 
tions as the training data and evaluate each model on the 
remaining 10% observations used as the held-out data. 


Since the code for Quad-App and PW-QuadApp base¬ 
lines (both proposed in (Ermis and Bouchard. 20141) 1 is not 
publicly available, we are only able to report the results 
for the Kinship, UMLS, and MovieLe ns da ta set (using 
the results re ported in (lErmis and Bouchardl 120141) 1. For 
Bayesian CP ( Rai et al.. 120141) . we use the code provided 




















































Table 1: Tensor completion accuracies in terms of AUC-ROC scores. Results are averaged over 10 splits of training and test data. Note: 
( 1 ) Bayesian CP was infeasible to run on the Scholars and Facebook data; (2) Due to the lack of publicly availab le code for Q uad-App 
and PQ-QuadApp, we only report its results on Kinship, UMLS, and MovieLens data (results taken from lErmis and Bouchard . 120141 )). 



Kinship 

UMLS 

Movielens 

DBLP 

Scholars 

Facebook 

Quad-App (Ermis and Bouchard, 2014) 

0.8193 

0.8205 

0.8511 

- 

- 

- 

PW-QuadApp (Ermis and Bouchard, 2014) 

0.9213 

0.9387 

0.9490 

- 

- 

- 

Bayesian-Logistic-CP (Rai et al., 2014) 

0.9865 

0.9965 

0.9799 

0.9307 

- 

- 

ZTP-CP (Batch MCMC) 

0.9674 

0.9938 

0.9895 

0.9759 

0.9959 

0.9830 

ZTP-CP (Online MCMC) 

0.9628 

0.9936 

0.9841 

0.9743 

0.9958 

0.9844 


by the authors. Moreover, the Bayesian CP baseline was 
found infeasible to run on the Scholars and Facebook data 
(both of which are massive tensors), so we are unable to 
report those results. For fairness, on Kinship, UMLS, and 
MovieLens data, we use the same exp erimental settings for 
all the methods as used by ( Ermis and Bouchardl 20141) . 


As shown in Table Q] our model outperforms Quad-App 
and PW-QuadApp in terms of the tensor-completion ac¬ 
curacies, and performs comparably or better than Bayesian 
CP, while being an order of magnitude faster (Section 16721 
shows the results on running times). 

6.2 SCALABILITY 


We next compare our model with Bayesian CP (Rai et al 


20141) in terms of the running times vs tensor comple¬ 
tion accuracy on Kinship and UMLS data sets. As shown 
in Fig. [2] (top-row), our model (batch as well as online 
MCMC) runs/converges an order of magnitude faster than 
Bayesian CP in terms of running time. On Scholars and 
Facebook, since Bayesian CP was infeasible to run, we are 
only able to show the results (Fig. [2] bottom-row) for our 
model, with batch MCMC and online MCMC inference. 
On all the data sets, the online MCMC runs/converges 
faster than the batch MCMC. 

We would like to_note that, although the model proposed 
in ( Ermis and Bouchard . 20141) also scales linearly @ in 
the number of ones in the tensor, the per-iteration time- 
complexity of our model, which is linear in both nnz(£>) 
as well a_s rank R, is better than the model proposed 
in (Ermis and Bouchard, 2014) (which has quadratic de¬ 
pendence on R). Moreover, the tensor completion results 
of our model (shown in TablcfT} on these data se ts are better 
than the ones reported in ( Ermis and Bouchard , 20141) . 


6.3 MULTIWAY TOPIC MODELING 

We also apply our model for a multiway topic modeling 
task on the Scholars data. The binary tensor represents 
AUTHORS x WORDS x VENUES relationships. We apply 
our model (with batch MCMC) and examine the latent fac¬ 
tors of each of the three dimensions. Since each factor is 
drawn from a Dirichlet, it is non-negative and naturally cor- 

2 Although dErmis and Bouchardl . 1 20141) reported run times on 
Kinship and UMLS data sets, those number are not directly com¬ 
parable with our run times reported here (due to possibly different 
machine configuration, which they do not specify in the paper). 


Kinship Data UMLS Data 




Figure 2: Running time (log-scale) comparison of various meth¬ 
ods on Kinship (top left), UMLS (top right), Scholars (bottom 
left), and Facebook (bottom right) datasets. 

responds to a “topic”. In Table[2] after examining the words 
factor matrix, we show the top-10 words for four of the fac¬ 
tors (topics) inferred by our model; these factors seem to 
represent topics Evolutionary Biology, Medical Imaging, 
Machine Learning/Signal Processing, and Oncology. For 
the Machine Learning/Signal Processing topic, we also ex¬ 
amine the corresponding topic in the venues factor matrix 
and show the top-10 venues in that topic (based on their 
factor scores in that factor). In Fig. [3] we also show the 
histograms of authors’ department affiliations for each of 
the four topics and the results make intuitive sense. The re¬ 
sults in Tableland Fig.[3]demonstrate the usefulness of our 
model for scalable topic modeling of such multiway data. 

6.4 LEVERAGING THE MODE NETWORK 

Finally, to investigate the usefulness of leveraging the mode 
network, we experiment with using both the tensor and the 
mode network on Scholars and Facebook data sets. For 
each data set, we report the AUC-ROC (area under the ROC 
curve) and AUC-PR (area under the precision-recall curve) 
on the tensor completion task, with and without network. 
For both data sets, we experiment with the more challeng¬ 
ing cold-start setting. In particular, for the Facebook data, 
we hold out all the entries of the tensor slices after the first 
50,000 wall-owners and predict those entriesfusing only 
the rest of the tensor, and using the rest of the tensor as 
well as the friendship network). We run the experiment 
with R. = 20 and minibatch size of 50,000 for the online 






















































Table 2: For the Scholars data, the most probable words in topics related to evolutionary biology (Evo Bio), medical imaging (Med 
Imag), machine learning/signal processing(ML/SP) and oncology, and top ranked venues in ML/SP 


EVO BIO 

Med Imag 

ML/SP 

Oncology 

Top Venues in ML/SP 

SPECIES 

IMAGING 

BAYESIAN 

RADIATION 

ICASSP 

SELECTION 

CONTRAST 

ALGORITHM 

RADIOTHERAPY 

JASA 

GENETIC 

COMPUTED 

SAMPLING 

STAGE 

ICML 

EVOLUTION 

RESONANCE 

FEATURES 

TUMOR 

IEEE TRANS IMG PROC 

POPULATIONS 

DOSE 

PROCESS 

SURVIVAL 

NIPS 

EVOLUTIONARY 

TOMOGRAPHY 

SPARSE 

LUNG 

COMPU STAT DATA ANALY 

GENE 

MAGNETIC 

NONPARAMETRIC 

CHEMOTHERAPY 

Biometrics 

VARIATION 

IMAGE 

GIBBS 

TREATED 

Bayesian analysis 

PLANTS 

QUALITY 

PARAMETERS 

TOXICITY 

JMLR 

NATURAL 

DIAGNOSTIC 

INFERENCE 

ONCOLOGY 

IEEE TRANS. INF. THEORY 



Figure 3: Histogram of the department-affiliations for the top 
20 authors in factors related to evolutionary biology (top left), 
medical imaging (top right), machine learning/signal process- 
ing(bottom left) and oncology (bottom right). 


MCMC. The results in Table[3]show that using the network 
leads to better tensor completion accuracies. 

We also perform a similar experiment on the Scholars data 
where we hold out all the entries in tensor slices after the 
first 1000 authors and predict those entries (using only the 
rest of the tensor, and using the rest of the tensor as well as 
the co-authorship network). We run the experiment with 
R = 100 and minibatch size of 50,000 for the online 
MCMC. The results shown in Table [T] again demonstrate 
the benefit of using the network. 


Table 3: Cold-start setting 



Facebook 

Scholars 


AUC-ROC 

AUC-PR 

AUC-ROC 

AUC-PR 

Without network 

0.8897 

0.6076 

0.8051 

0.5763 

With network 

0.9075 

0.7255 

0.8124 

0.6450 


In Fig. 0] we show another result demonstrating the benefit 
of using the co-authorship network for the Scholars data. 
Note that in the cold-start setting, there is no information 
in the tensor for the held-out authors. Therefore the top¬ 
ics associated with such authors are expected to be roughly 
uniformly random. As shown in Fig.0](left column), the set 
of held-out authors assigned to the topics medical imaging 
and oncology seem very random and arbitrary (we only 
show the aggregate department-affiliations). Using side- 
information (in form of the co-authorship network), how¬ 
ever, the model sensibly assigns authors who are indeed 
related to these topics, as shown in right column of Fig. [4] 



Figure 4: Histogram of the department-affiliations of the top 15 
held-out authors associated with the factors of medical imaging 
(top) and oncology (bottom). The left column is obtained using 
no co-authorship information, and the right column is obtained 
using co-authorship information. 


7 CONCLUSION 

We have presented a scalable Bayesian model for binary 
tensor factorization. In contrast to the models based on pro¬ 
bit or logistic likelihood for binary tensor decomposition, 
the time-complexity of our model depends only in the num¬ 
ber of ones in the tensor. This aspect of our model allows 
it to easily scale up to massive binary tensors. The simplic¬ 
ity of our model also leads to simple batch as well as on¬ 
line MCMC inference; the latter allows our model to scale 
up even when the number of ones could be massive. Our 
experimental results demonstrate that the model leads to 
speed-ups of an order of magnitude when compared to bi¬ 
nary tensor factorization models based on the logistic like¬ 
lihood, and also outperforms various other baselines. Our 
model also gives interpretable results which helps qualita¬ 
tive analysis of results. In addition, the ability to leverage 
mode networks (fully or partially observed) leads to im¬ 
proved tensor decomposition in cold-start problems. 
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